Introduction to Bayesian Models

Steve Elston

10/13/2022

Review

The concept of likelihood and maximum likelihood estimation (MLE) have been at the core of much of statistical modeling for about 100 years

Review

Statistical inference seeks to characterize the uncertainty in statistical point estimates

Review

Nonparametric bootstrap estimation is widely useful and requires minimal assumption

Review

There are several variations of the basic nonparametric bootstrap algorithm

Review

Re-sampling methods are general and powerful but, there is no magic involved! There are pitfalls!

Introduction Baysian Models

Despite the long history, Bayesian models have not been used extensively until recently

Introduction

Bayesian analysis is a contrast to frequentist methods

Bayesian Model Use Case

Bayesian methods made global headlines with the successful location of the missing Air France Flight 447

Posterior distribution of locations of Air France 447

Posterior distribution of locations of Air France 447

Bayesian Model Use Case

Kratzke, Stone and Frost developed an optimal search missing planner using Baysian model

Screen shot from USCG search planner

Screen shot from USCG search planner

Bayesian vs. Frequentist Views

With greater computational power and general acceptance, Bayes methods are now widely used

Bayesian vs. Frequentist Views

Can compare the contrasting frequentist and Bayesian approaches

Comparison of frequentist and Bayes methods

Comparison of frequentist and Bayes methods

Review of Bayes Theorem

Bayes’ Theorem is fundamental to Bayesian data analysis.

\[P(A \cap B) = P(A|B) P(B) \]

We can also write:

\[P(A \cap B) = P(B|A) P(A) \]

Eliminating \(P(A \cap B):\)

\[ P(B)P(A|B) = P(A)P(B|A)\]

And finally, Bayes theorem!

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

Bayes Theorem

Bayes Theorem!

Bayes Theorem!

Marginal Distributions

In many cases we are interested in the marginal distribution

\[p(\theta_1) = \int_{\theta_2, \ldots, \theta_n} p(\theta_1, \theta_2, \ldots, \theta_n)\ d\theta2, \ldots, d \theta_n\] - But computing this integral is not easy!

Marginal Distributions

\[ p(\theta) = \sum_{x \in \mathbf{X}} p(\theta |\mathbf{X})\ p(\mathbf{X}) \]

\[ p(\mathbf{X}) = \sum_{\theta \in \Theta} p(\mathbf{X} |\theta) p(\theta) \]

Interpreting Bayes Theorem

How can you interpret Bayes’ Theorem?

\[Posterior\ Distribution = \frac{Likelihood \bullet Prior\ Distribution}{Evidence} \]

\[ posterior\ distribution(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \\ \frac{Likelihood(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃rior(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]

\[ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \frac{P(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]

Interpreting Bayes Theorem

What do these terms actually mean?

  1. Posterior distribution of the parameters given the evidence or data, the goal of Bayesian analysis

  2. Prior distribution is chosen to express information available about the model parameters apriori

  3. Likelihood is the conditional distribution of the data given the model parameters

  4. Probabiltiy of Data or evidence is the distribution of the data and normalizes the posterior

Relationships can apply to the parameters in a model; partial slopes, intercept, error distributions, lasso constants, etc

Applying Bayes Theorem

We need a tractable formulation of Bayes Theorem for computational problems

\[ 𝑃(𝐵 \cap A) = 𝑃(𝐵|𝐴)𝑃(𝐴) \\ And \\ 𝑃(𝐵)=𝑃(𝐵 \cap 𝐴)+𝑃(𝐵 \cap \bar{𝐴}) \]

Where, \(\bar{A} = not\ A\), and the marginal distribution, \(P(B)\), can be written:

\[ 𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴}) \]

Applying Bayes Theorem

Using the foregoing relations we can rewrite Bayes Theorem as:

\[ P(A|B) = \frac{P(A)P(B|A)}{𝑃(𝐵│𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴})} \\ \]

Write Bayes Theorem as:

\[𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)\]

Ignoring the normalization constant \(k\):

\[𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)\]

Interpreting Bayes Theorem

Denominator must account for all possible outcomes, or alternative hypotheses, \(h'\):

\[Posterior(hypothesis\ |\ evidence) =\\ \frac{Likelihood(evidence\ |\ hypothesis)\ prior(hypothesis)}{\sum_{ h' \in\ All\ possible\ hypotheses}Likelihood(evidence\ |\ h')\ prior(h')}\]

Simplified Relationship for Bayes Theorem

How to we interpret the foregoing relationship?

\[Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\ Or\\ 𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 │ 𝑑𝑎𝑡𝑎 ) \propto 𝑃( 𝑑𝑎𝑡𝑎 | 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) \]

Creating Bayes models

The goal of a Bayesian analysis is computing and performing inference on the posterior distribution of the model parameters

The general steps are as follows:

  1. Identify data relevant to the research question

  2. Define a sampling plan for the data. Data need not be collected in a single batch

  3. Define the model and the likelihood function; e.g. regression model with Normal likelihood

  4. Specify a prior distribution of the model parameters

  5. Use the Bayesian inference formula to compute posterior distribution of the model parameters

  6. Update the posterior as data is observed

  7. Inference on the posterior can be performed; compute credible intervals

  8. Optionally, simulate data values from realizations of the posterior distribution. These values are predictions from the model.

Updating Bayesian Models

An advantage of Bayesain model is that it can be updated as new observations are made

How can you choose a prior?

The choice of the prior is a difficult, and potentially vexing, problem when performing Bayesian analysis

How can you choose a prior?

Some possible approaches to prior selection include:

How can you choose a prior?

How to use prior empirical information to estimate the parameters of the prior distribution

Conjugate Prior Distributions

An analytically and computationally simple choice for a prior distribution family is a conjugate prior

Conjugate Prior Distributions

Most commonly used distributions have conjugates, with a few examples:

Likelihood Conjugate
Binomial Beta
Bernoulli Beta
Poisson Gamma
Categorical Dirichlet
Normal - mean Normal
Normal - variance, \(\chi^2\) Inverse Gamma
Normal - inverse variance, \(\tau\) Gamma

Example using Conjugate Distribution

We are interested in analyzing the incidence of distracted drivers

\[ P(k) = \binom{n}{k} \cdot \theta^k(1-\theta)^{n-k}\]

Working with Conjugate Distribution

Our process for example is:

  1. Use the conjugate prior, the Beta distribution with parameters \(\alpha\) and \(\beta\) (or a,b)
  2. Using the data sample, compute the likelihood
  3. Compute the posterior distribution of distracted driving
  4. Add more evidence (data) and update the posterior distribution.

Example using Conjugate Distribution

What are the properties of the Beta distribution?

Beta distribution for different parameter values

Beta distribution for different parameter values

Example using Conjugate Distribution

Consider the product of a Binomial likelihood and a Beta prior

\[\begin{align} posterior(\theta | z, n) &= \frac{likelihood(z,n | \theta)\ prior(\theta)}{data\ distribution (z,n)} \\ p(\theta | z, n) &= \frac{Binomial(z,n | \theta)\ Beta(\theta)}{p(z,n)} \\ &= Beta(z + a,\ n-z+b) \end{align}\]

Example using Conjugate Distribution

There are some useful insights you can gain from this relationship for (discrete) integer counts:

\[ posterior(\theta | z, n) = Beta(z + a,\ n-z+b) \]

-Evidence is also in the form (actual) counts of successes, \(z\) and failure, \(n-z\)
- The more evidence the greater the influence on the posterior distribution
- Large amount of evidence will overwhelm the prior
- With large amount of evidence, posterior converges to frequentist model

Example using Conjugate Distribution

Consider example with:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 10\) and failures, \(= 30\)
- Posterior is \(Beta(10 + 2,\ 40 - 10 + 10) = Beta(12,\ 40)\)

Prior, likelihood and posterior for distracted driving

Prior, likelihood and posterior for distracted driving

Sampling the Posterior

How can we find an estimate of the poster distribution?

  1. We can sample from the analytic solution - if we have a conjugate

  2. We can sample the likelihood and prior, take the product and normalize - for any posterior

  3. Grid sample or Markov chain Monte Carlo (MCMC) sample

Sampling the Posterior

Grid sampling is a naive approach

Sampling grid for bivariate distribution

Sampling grid for bivariate distribution

Sampling the Posterior

Algorithm for grid sampling to compute posterior from likelihood and prior

Procedure CreateGrid(variables, lower_limits, upper_limits): 
    # Build the sampling grid 
    return sampling_grid   
    
Procedure SampleLikelihood(sampling_value, observation_values):    
    return likelihood_function(sampling_value, observation_values)
    
Procedure Prior(sampling_values, prior_parameter_value):    
    return prior_density_function(sampling_value, prior_parameter_values)    
    
ComputePosterior(variables, lower_limits, upper_limits):    
    # Initialize the sampling grid
    Grid = CreateGrid(variables, lower_limits, upper_limits)
    
    # Initialize array to hold sampled posterior values       
    array posterior[range(Grid)]

    # Compute posterior at each sampling value in the grid  
    for sampling_value in range(lower_limits, upper_limits):   
        likelihood = SampleLikelihood(sampling_value, observation_values)
        prior = Prior(sampling_values, prior_parameter_value)   
        posterior[sampling_value] = likelihood * prior

    # Normalize the posterior       
    probability_data = sum(posterior[range(Grid)])
    posterior = posterior[range(Grid)]/probability_data 
    return posterior    

Credible Intervals

How can we specify the uncertainty for a Bayesian parameter estimate?

Credible Intervals

How can we specify the uncertainty for a Bayesian parameter estimate?

Credible Intervals

What are the 95% credible intervals for \(Beta(12,\ 40)\)?

Probability of distract drivers for next 10 cars

Probability of distract drivers for next 10 cars

Credible Intervals are not Confidence Intervals

How are credible intervals different from the more familiar confidence intervals?

Confidence intervials and credible intervals are conceptually quite different

A confidence interval is a purely frequentest concept
- Is an interval on the sampling distribution where repeated samples of a statistic are expected with probability \(= \alpha\)
- Cannot interpret a confidence interval as an interval on a probability distribution of the value of a statistic!

Credible interval is an interval on a posterior distribution of the statistic
- Credible interval is exactly what the misinterpretation of the confidence interval tries to be
- Credible interval is the interval with highest \(\alpha\) probability for the statistic being estimated

For symmetric posterior distributions, the credible interval will be numerically the same as the confidence interval
- This need not be the case in general

Credible Intervals are not Confidence Intervals

Compare confidence interval and credible interval for the case of 10 observations

Difference between credible and confidence intervals

Difference between credible and confidence intervals

Simulating from the posterior distribution: predictions

What else can we do with a Bayesian posterior distribution beyond credible intervals?

Simulating from the posterior distribution: predictions

Example; What are the probabilities of distracted drivers for the next 10 cars with posterior, \(Beta(12,\ 40)\)?

Probability of distract drivers for next 10 cars

Probability of distract drivers for next 10 cars

Summary

Bayesian analysis is a contrast to frequentist methods

Summary

Bayesian analysis is in contrast to frequentist methods